In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8

Reshaping data with stack and unstack

Pivoting

Data is often stored in CSV files or databases in so-called “stacked” or “record” format:


In [2]:
df = pd.DataFrame({'subject':['A', 'A', 'B', 'B'], 
                   'treatment':['CH', 'DT', 'CH', 'DT'], 
                   'concentration':range(4)},
                 columns=['subject', 'treatment', 'concentration'])
df


Out[2]:
subject treatment concentration
0 A CH 0
1 A DT 1
2 B CH 2
3 B DT 3

A better representation might be one where the different subjects are in rows, the applied treatments are in columns and outcomes are in the data frame values.

You can achieve this by pivot function:


In [3]:
pivoted = df.pivot(index='subject', columns='treatment', values='concentration')
pivoted


Out[3]:
treatment CH DT
subject
A 0 1
B 2 3

If there is more that one record for each pair of "subject" and "treatment" (for example, the subject was tested twice with the same treatment at different times) you can use pivot_table. It works just like pivot but it allows to specify additionally an aggregation function ('mean' by default).

To take another example, we will use some data from expeditions to the Pole of Inaccessibility. We will read the data from SQL database.


In [4]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/survey.db')

In [5]:
visited = pd.read_sql('Visited', engine, index_col='ident', parse_dates=['dated'])
visited


Out[5]:
site dated
ident
619 DR-1 1927-02-08
622 DR-1 1927-02-10
734 DR-3 1930-01-07
735 DR-3 1930-01-12
751 DR-3 1930-02-26
752 DR-3 NaT
837 MSK-4 1932-01-14
844 DR-1 1932-03-22

In [6]:
readings = pd.read_sql('Survey', engine).dropna()
readings = readings.drop_duplicates()
readings


Out[6]:
taken person quant reading
0 619 dyer rad 9.82
1 619 dyer sal 0.13
2 622 dyer rad 7.80
3 622 dyer sal 0.09
... ... ... ... ...
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.50
20 844 roe rad 11.25

19 rows × 4 columns

EXERCISE: Join the `readings` and `visited` tables.

In [ ]:

EXERCISE: Pivot the table such that we have sites in rows and different quantities in columns.

In [ ]:

Hierarchical index

Hierarchical index of pandas is a way of introducing another dimension to a (two-dimensional) data frame. This is implemented by having multiple levels of the index. Let's look at an example.


In [9]:
multi = df.set_index(['subject', 'treatment'])
multi


Out[9]:
concentration
subject treatment
A CH 0
DT 1
B CH 2
DT 3

Note how the two indexes are nested: 2nd level index ('treatment') is grouped under the first level index ('subject'). To access the two levels you can use labels from the first level or both levels using a tuple.


In [10]:
multi.loc['A'] # first level only


Out[10]:
concentration
treatment
CH 0
DT 1

Note that it creates a standard data frame with "flat" index.


In [11]:
multi.loc[('A', 'CH')] # two level


Out[11]:
concentration    0
Name: (A, CH), dtype: int64

Indexing on the second index only may be slightly involved:


In [12]:
multi.loc[(slice(None), 'CH'), :]


Out[12]:
concentration
subject treatment
A CH 0
B CH 2

Consult the documentation for other methods.

To return to orginal format with columns insted of indexes use reset_index:


In [13]:
multi.reset_index()


Out[13]:
subject treatment concentration
0 A CH 0
1 A DT 1
2 B CH 2
3 B DT 3
EXERCISE: Group the survey data by sites, date of measurement on each site and the quantity measured. List all readings for `site` DR-1; all readings of radiation using the hierchical index.

In [ ]:

stack/unstack

stack — shifts last level of hierarchical rows to columns

unstack — does the opposite, i.e. shifts last level of hierarchical columns to rows


In [15]:
result = multi['concentration'].unstack()
result


/home/bartosz/anaconda/envs/swc_eitn/lib/python3.5/site-packages/pandas/core/index.py:4281: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return np.sum(name == np.asarray(self.names)) > 1
Out[15]:
treatment CH DT
subject
A 0 1
B 2 3

unstack reverses the operation:


In [16]:
result.stack()


Out[16]:
subject  treatment
A        CH           0
         DT           1
B        CH           2
         DT           3
dtype: int64

We can "stack" it even further:


In [17]:
df = multi.stack()
df


Out[17]:
subject  treatment               
A        CH         concentration    0
         DT         concentration    1
B        CH         concentration    2
         DT         concentration    3
dtype: int64
EXERCISE: Rearange the data frame from last exercise, such that rows contain sites and dates (hierchical index) and columns different quantities. List all readings of radiation.

In [ ]:

Formatting data — Case study

Going further with the time series case study test on the AirBase (The European Air quality dataBase) data.

One of the actual downloaded raw data files of AirBase is included in the repo:


In [19]:
!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012


1990-01-01	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0

Just reading the tab-delimited data:


In [20]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None)

data.head()


Out[20]:
1990-01-01 -999.000 0 -999.000.1 0.1 -999.000.2 0.2 -999.000.3 0.3 -999.000.4 ... -999.000.19 0.19 -999.000.20 0.20 -999.000.21 0.21 -999.000.22 0.22 -999.000.23 0.23
0 1990-01-02 -999 0 -999 0 -999 0 -999 0 -999 ... 57 1 58 1 54 1 49 1 48 1
1 1990-01-03 51 1 50 1 47 1 48 1 51 ... 84 1 75 1 -999 0 -999 0 -999 0
2 1990-01-04 -999 0 -999 0 -999 0 -999 0 -999 ... 69 1 65 1 64 1 60 1 59 1
3 1990-01-05 51 1 51 1 48 1 50 1 51 ... -999 0 -999 0 -999 0 -999 0 -999 0
4 1990-01-06 -999 0 -999 0 -999 0 -999 0 -999 ... -999 0 -999 0 -999 0 -999 0 -999 0

5 rows × 49 columns

The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data.

Lets replace the negative numbers by missing values and give columns proper names.


In [21]:
hours = map(str, range(24))
flags = ['flag'] * 24
col_names = ['date'] + list(sum(zip(hours, flags), ()))
col_names[:5]


Out[21]:
['date', '0', 'flag', '1', 'flag']

In [22]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t',
                   na_values=['-999', '-9999'],
                   names=col_names,
                   index_col='date')#, header=None)

For now, we disregard the 'flag' columns


In [23]:
data = data.drop('flag', axis=1)
data.head()


Out[23]:
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
date
1990-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1990-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN 48 ... 55 59 58 59 58 57 58 54 49 48
1990-01-03 51 50 47 48 51 52 58 57 NaN NaN ... 69 74 NaN NaN 103 84 75 NaN NaN NaN
1990-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 71 74 70 70 69 65 64 60 59
1990-01-05 51 51 48 50 51 58 65 66 69 74 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 24 columns

Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.

stack at work

We can now use stack and some other functions to create a timeseries from the original dataframe:

EXERCISE: Reshape the dataframe to a timeseries

The end result should look like:

BETR801
1990-01-02 09:00:00 48.0
1990-01-02 12:00:00 48.0
1990-01-02 13:00:00 50.0
1990-01-02 14:00:00 55.0
... ...
2012-12-31 20:00:00 16.5
2012-12-31 21:00:00 14.5
2012-12-31 22:00:00 16.5
2012-12-31 23:00:00 15.0

170794 rows × 1 columns

First, reshape the dataframe so that each row consists of one observation for one date + hour combination:


In [ ]:

Now, combine the date and hour colums into a datetime (tip: string columns can be summed to concatenate the strings):


In [ ]:


In [ ]:

Acknowledgement

© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).

© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)

Licensed under CC BY 4.0 Creative Commons